Driving singing behaviour in songbirds using a multi-modal, multi-agent virtual environment

Interactive biorobotics provides unique experimental potential to study the mechanisms underlying social communication but is limited by our ability to build expressive robots that exhibit the complex behaviours of birds and small mammals. An alternative to physical robots is to use virtual environments. Here, we designed and built a modular, audio-visual 2D virtual environment that allows multi-modal, multi-agent interaction to study mechanisms underlying social communication. The strength of the system is an implementation based on event processing that allows for complex computation. We tested this system in songbirds, which provide an exceptionally powerful and tractable model system to study social communication. We show that pair-bonded zebra finches (Taeniopygia guttata) communicating through the virtual environment exhibit normal call timing behaviour, males sing female directed song and both males and females display high-intensity courtship behaviours to their mates. These results suggest that the environment provided is sufficiently natural to elicit these behavioral responses. Furthermore, as an example of complex behavioral annotation, we developed a fully unsupervised song motif detector and used it to manipulate the virtual social environment of male zebra finches based on the number of motifs sung. Our virtual environment represents a first step in real-time automatic behaviour annotation and animal–computer interaction using higher level behaviours such as song. Our unsupervised acoustic analysis eliminates the need for annotated training data thus reducing labour investment and experimenter bias.

www.nature.com/scientificreports/ cues play crucial roles in the natural behaviour 25 . To situate a zebra finch in a virtual environment requires at least sound and vision but is likely also influenced by gaze 29 and orientation relative to other agents 30 . Previous work has shown that zebra finches interact vocally with an immobile physical decoy providing audio from a built-in speaker 10,31 and are physically attracted to more life-like actuated zebra finch robots 15 . Furthermore, adult finches can recognize and discriminate between conspecifics from live video feeds 32,33 and sing song to still images 34 or live video feeds of females 34,35 . Also, juvenile males can learn song from video and audio playback of a tutor 36,37 . Finally, perturbation of virtual auditory environments can drive active error correction of song 38,39 . Taken together, these studies suggest that zebra finches allow studying social vocal behaviour in a virtual environment. However, to our best knowledge no multi-agent setup currently exists that supports multi-modal manipulation of vocal communication, and we do not know if zebra finches exhibit normal vocal behaviour when placed in such a virtual environment.
Here we designed and built a multi-agent, audio-visual environment for experimentally manipulating social communication. Our system implementation is based on events processing that allows for much more complex computation than the state-of-the-art and a step towards integrating closed-loop control and AI in studying social communication. The latency of the current system was latency 308 ms and 383 ms for audio and video, respectively. The delay may restrict some interactive behaviours, but still allows a wide range of applications. Before using the system in biological experiments, we first wanted to test if the audiovisual environment is sufficiently realistic to elicit established behavioral responses in a social context. Therefore, in this paper we quantified several well-described social vocal behaviours in zebra finches when communicating through the environment. We show that zebra finches exhibit normal call timing behaviour to their partner, that males sing female directed song and both males and females display high-intensity courtship behaviours when communicating through the virtual environment. As an example of exploiting complex computation, we developed a fully unsupervised song motif detector. We subsequently used this detector to manipulate the virtual social environment of male zebra finches making them sing directed song to a virtual mate. Although we accepted longer system latency, our results suggest that the virtual environment is sufficiently natural to elicit courtship behaviours. Our work represents a first step in using automatic behaviour annotation for animal-computer interaction during higher level behaviours such as song.

Results
We present a modular, audio-visual virtual environment able to experimentally manipulate social communication. Our system allows for multi-modal, multi-agent interaction and focuses on songbirds. Our setup is implemented in a box placed inside a cage and the cage is placed in a sound attenuating isolator box. We record and present a high-speed (60 fps) visual environment through a teleprompter system that allows direct eye contact and ensures a realistic visual perspective of the video (Fig. 1A). The cage has two perches with presence sensors (Fig. 1A); one in front of the teleprompter screen (front perch) and one behind an opaque divider that does not allow visual contact with the screen (back perch). Connecting two setups provides the visual impression that the other animal is located 20 cm away (Fig. 1B). We furthermore record the acoustic environment and present audio from a speaker located behind the teleprompter to provide the cue that sound and video have the same spatial origin. Data from all sensors is streamed on a network, translated into events using cloud-based event processing 40 and is captured for offline processing by a node connected to storage (historian, Fig. 1C). This modular and distributed design allowed for scaling of individual parts of the system (e.g., to add sensors or software analyses) and can be extended to connect multiple setups.
Event Processing 41 was used to represent onset and offset of behavioural features and event streams from multiple producers were combined to form new events (Fig. 1D). The position of a single bird generated three different events for absence/presence on the front or back perch or other location (Fig. 1D, cyan, yellow and magenta lines). We assumed a visual contact event when both birds perched in front of the screen. To detect and identify vocal signals, the audio stream from the male was analysed real-time (see "Methods" section) to generate events based on power and entropy threshold-crossings (Fig. 1D, green lines). A bout of song was detected by combining power and entropy with hysteresis thus suppressing most noise. Finally, a directed song (DS) event was generated when bout and contact were active at the same time (Fig. 1D, bottom line) in other words, when a male was singing while both male and female were sitting on the front perch.
To investigate the animals' motivation for social interaction through the setup, we measured the perch preference of twelve pair-bonded male-female couples in four different audio-visual modality combinations of speaker/ screen on and off. When a modality was on it was streamed to the other setup with a delay (see "Methods" section). The delays for audio and video were 308 and 383 ms, respectively. Our data showed that birds spent significantly more time on the front perch when one modality (either sound (two-proportion z-test, 1% significance level, n = 24 ; z = −6.56 , p = 0.0 ) or video ( z = −10.10 , p = 0.0 )) from the other bird was on (Fig. 1E). When both video and audio modalities were on, birds also spent more time on the front perch than video-only ( z = −2.31 , p = 0.0105 ) but not when compared to live audio ( z = 1.23 , p = 0.8910 ). This demonstrates that the birds were attracted to both audio and visual signals of another individual supplied by the system.
To demonstrate that the setup provided a sufficiently natural social environment, we exploited two key behavioural responses in communication between pair-bonded individuals: call timing and directed song. Coordinated call production between partners is a well described behaviour in birds, where it is thought to influence pair-bond maintenance and mate guarding 42 . Zebra finches show time-locked call behaviour using two types of calls: Tet and stack calls 26 . Both are short, low power vocalizations used when the birds are physically close together 25,28 . We quantified the call timing of calls between established pair bonded couples communicating through our virtual environment ( Fig. 2A) and identified calls using a supervised random forest classifier (see "Methods" section). With both audio and visual modalities on, the delay from hearing a call to producing one was unimodally  27 ). Because calls were synchronized and contingent on the call of the mate, we conclude that the birds displayed natural call timing behaviour through the virtual environment. Next, we studied whether males exhibited natural singing behaviour to their virtual mate. Male zebra finches sing both to females (directed song, DS) and not directed towards any particular conspecific (undirected song, US) 25 . The song consists of introductory notes and a stereotyped sequence of syllables, called the motif, that is often repeated several times to form a song bout 25,43 . Although the DS and US motif consist of the same syllable sequence, several key acoustic features are different between DS and US. The DS motif is delivered faster and is  www.nature.com/scientificreports/ preceded by more introductory notes 44 . It also has more repetitions of the motif in each bout, increased sequence stereotopy 43 and DS syllables exhibit less variation in the fundamental frequency (FF) of harmonic stacks 45 .
We studied five established pair-bonded couples communicating through our environment and we isolated candidate DS events as the simultaneous occurrence of bout and contact events, i.e., when both animals were perched in front of the screen and the male was vocalizing. The video segments of potential DS events were subsequently scored for accompanied behaviour by experienced observers (IA, CPHE). All (5/5) males sang directed song to their virtual mates and displayed courtship behaviours, such as fluffing, beak wipes and jumping, that are indicative of DS (Fig. 3A) at high intensity. The coefficient of variance (CV) of the fundamental frequency ( Fig. 3B,C) was significantly lower (Wilcoxon signed rank test 1% significance level, n = 5 , W = 15 , p = 0.0312 , CLES = 0.76 , Fig. 3D) when the virtual environment was on compared to off further indicating DS. Taken together our data strongly suggest that all males sang DS to their virtual mates.
In summary, call timing between individuals was comparable to that of freely communicating animals and males sang DS to their virtual mates, which showed that our virtual environment provided a sufficiently natural environment for normal behavioural responses in songbirds.
A final crucial component in the design of a closed-loop system is the ability to manipulate an agent's environment and thereby drive its behaviour within its perceptual real-time. When zebra finch males sing DS to a female, they typically habituate to its presence, which leads to a reduction in the number of motifs per minute 44 . In experiments requiring extended periods of DS and/or a high number of motifs, this effect is typically countered by introducing novel females to reinvigorate the male 44,46 . Here we aimed to drive DS behaviour by presenting different virtual females based on the measured song performance of the male.
Driving song behaviour based on song performance requires detection of the stereotyped syllable sequence, i.e., the motif. Therefore we developed a novel, unsupervised motif detector (see "Methods" section). The detector is based on dimensionality reduction of feature vectors generated from the spectrogram of sound segments ( Fig. 4A-E). Training of the model was based on 60,000 feature vectors per animal, each representing a segment of sound. The feature vectors were embedded in a 2D space using t-distributed Stochastic Neighbor Embedding 47 and the watershed transform 48 was used to cluster the space into a behaviour map 49 . Next, we computed the transition probability matrix between all syllables in the training data ( Fig. 4H) and used it to detect the most stereotyped sequence of syllables by starting at the globally most likely transition and following the path of locally most likely transitions (Fig. 4H). In all the males, this path contained a cycle that we defined as the motif of the individual and it was confirmed by experienced observers (IA, CPHE) to be the correct motif.
Next, we extended the method to detect motifs real-time (Fig. 5A). We detected syllable events by analysing the audio stream and post-embedding the sound segments into the previously computed 2D space (Fig. 5B). Syllable events were then collected in sequence events that were screened for ordered subsets of the motif (see "Methods" section) to create the motif event (see example in Fig. 5C). The entire process was parallelised to achieve real-time detection.
We exposed males to 1 min audio-visual recordings of one female in an excited state from the DS experiments, as long as motifs were detected (see "Methods" section). After 3 min without motif detection, we switched to the audio-visual recordings of another female. Driving the behaviour over 2 h, males sang significantly more motifs (Wilcoxon signed rank test 5% significance level, n = 8 , W = 33 , p = 0.0195 , CLES = 0.625 ) compared to the control US period (Fig. 6A,B). Because we did not randomize the experimental sequence, we cannot exclude that treatment ordering contributed to the observed effect. To confirm that the birds sang DS motifs, we computed the CV of FF in a motif syllable containing a harmonic stack. The CV had the trend to be lower during the driving period compared to the US control period, which suggests that the males sang DS to the virtual females    6C). Taken together, our system made the birds sing (i) directed motifs and (ii) more so in 2 h compared to undirected motifs in the control period, thus demonstrating the ability to drive directed singing behaviour.

Discussion
We present an experimental environment to study social communication that allows multi-modal, multi-agent interaction. Our goal was to first test if the audiovisual environment was sufficiently realistic to elicit established behavioural responses in social communication, before using the system in biological experiments. Zebra finches communicating within the modular virtual environment emitted calls that were synchronized and contingent on the call of the mate with response latencies as in real life situations 10,26,27 . Furthermore, our data show that males exhibited high-intensity courtship behaviour and sang directed song to their virtual females. To detect DS events, we used an easily implemented definition of DS as song that occurs assuming visual access. Previous studies also defined DS as song when the male was singing oriented towards a conspecific female 34 , but did not confirm this classification by further acoustic analysis such as decreased DS motif duration 44 , or decreased variation in the fundamental frequency of harmonic stacks in DS syllables 45 . Both, behavioural and acoustic analysis support that song elicited under our definition was indeed DS. Taken together, these data suggest that the virtual environment was sufficiently realistic to elicit essential behavioral responses and courtship behaviours from both sexes. Furthermore, by switching videos of excited females we could keep males to sing DS. Whether the increased amount of song was driven by video switching we cannot answer as this would require additional controls. Dissecting out the sensory and behavioral cues that trigger DS in a male and escalation of the courtship intensity are few of many of the interesting biological questions that will require complicated image and sound manipulation to answer. This would be a major advance that we hope our system will be able to contribute in the future.
We present and implemented a syllable-based unsupervised audio classifier which we think will be widely applicable in bioacoustics. Unsupervised clustering methods have been used in the analysis of vocalisations 50 but are typically based on a few dozen acoustic features. A more data-driven approach is to use spectrograms directly as high dimensional features 51 , which however imposes extensive computational costs. Here, we compressed the spectrograms to arrive at a manageable sized feature vector thereby keeping computational costs low. Especially for stereotyped behaviours, unsupervised methods like t-SNE 47 excel because clusters of repeated behaviours stand out from noise 49 . Furthermore, we parallelized a variation of the post-embedding algorithm described in 49 to achieve real-time classification. Lastly, we could determine each individual's motif in an unsupervised way by assuming only that the motif is the most repeated syllable string, thus exploiting the fact that zebra finch song is highly stereotyped. Our unsupervised method eliminates the need for annotated training data and thereby reduce labour investment and the risk for experimenter bias.
We acknowledge that our system latencies are rather large and can be optimized-though we note that the zebra finches displayed normal behavioral responses. There are three causes for these latencies. First, any modern video system copies a captured image as digital data, from the capture hardware to the render hardware, and this can only be avoided by custom design. Our choice of commodity components imposes a minimal latency because this unavoidable transfer time depends on the bus speed (and therefore the cost) of the computers used. It is always at least 1 frame period (17 ms). Second, given our architectural decisions to decouple and modularize, the data had to be transferred via network which again imposed a (cost-related) latency. The raw data rate of the camera is about 74 Mbit/s, necessitating gigabit networking and a delay of around 10 ms to transmit a frame. Further, jitter in the network travel time requires a jitter buffer at the rendering system to ensure uninterrupted timely presentation of frames. In all, this implies a minimal latency of about 60-80 ms for our system. The system latency can be reduced with faster, more expensive components. www.nature.com/scientificreports/ Our setup represents a first step in closed-loop automatic behaviour annotation and animal-computer interaction using higher level behaviours. We used events to represent behaviour and event-processing and microservices to achieve near real-time capabilities 40 . Several studies have demonstrated the power of near real-time processing in closed loop assays in neuroscience 52,53 , to manipulate pitch in songbirds [54][55][56] or to provide virtual reality [18][19][20] . Those studies take advantage of computationally attractive features, such as action potentials or acoustic features, to allow near real-time system response. We acknowledge that the implementation of our system architecture is slower, but we chose to devote engineering effort and resources to automating analysis of higher levels of behavioural organisation -which required a different architecture and higher computational load than previous works -rather than latency optimization. One interesting future application of our system would be to systematically alter latency between modality stream to study the tolerance for delay on the dynamics of vocal exchanges.
Our system is modular and can be extended to multiple setups or to add more sensors, actuators, and computational units. We expect our modular setup to be applicable to other species of social birds and mammals. For example the sample rate of the audio interface can easily be increased to allow ultrasonic vocal communication in rodents. We deliberately based the setup on cheap distributed computers, free and open-source software, and cloud computing to ease the reuse of hardware and software modules in other projects, make it easier for multiple developers to contribute, and make the implementation more widely accessible. The distributed architecture complicates the system and increases the minimum latency, but allows it to scale linearly, makes it easier to maintain, and makes it resilient to single node failures. The system can be deployed anywhere with network and thereby enables global-scale social communication experiments. Setups situated around the globe could thus be connected and allow for unique long-term communication experiments between labs that are physically far apart.

Methods
The virtual environment. The virtual environment was built on the teleprompter principle, where a slanted one-way mirror allows the camera to record the bird through the mirror while the bird sees the reflection of a screen below (Fig. 1A).
A microphone is placed outside the cage above the perch in front of the screen while the rest of the system is placed inside a painted wooden box placed in the cage. The one-way mirror is constructed from a sheet of 3 mm transparent acrylic plexiglass coated by 0.02 mm silver one-way film with 70% light admittance and 99% reflectance.
The screens are trichromatic 7″ LCD displays in 800x480 pixel resolution. Although birds possess at least tetrachromatic or even pentachromatic vision 57 , previous studies showed that males sing when presented with live video of conspecific females on trichromatic screens 32,34,35 . However, critical to eliciting courtship behaviour was the use of 100 Hz screens 32,35 that are above the flicker-frequency of birds 57,58 or non-flickering liquid-crystal displays (LCD). Therefore, we decided to use 60 Hz LCD screens that present slower, but continuous, flickerfree images to the birds. The video is recorded with a Raspberry Pi Camera V2 in 800x480 pixel resolution at 60 frames per second (fps) and streamed to the network from a Raspberry Pi 3. The video delay was measured by simultaneously turning on an LED in both boxes and recording a video with an external camera showing both the LED and the screen. By counting the number of frames from the local LED turns on to the remote LED from the other box can be seen on the screen the delay can be calculated. This delay was measured to 383 ms (23 frames at 60 fps).
The audio playback comes from a 1.5 W mini-speaker placed behind the mirror. While originally playing sound at the received level of the microphone in the other box, we slightly attenuated this signal (-6dB) to avoid acoustic feedback. Because a 6dB attenuation corresponds to a doubling of distance of the source 59 , the bird in box would perceive the source level of the animal in box 2 as if it were two times further away than it was. However, because the source level of zebra finches in the lab varies from 50-70 dB reference 20 µPa at 1 m 60 , a 6 dB attenuation is well within the natural range of variation. Since we form a loop between the two cages a bird could hear a delayed and strongly attenuated version of its own vocalisation similar to hearing an echo. We measured the total attenuation to vary between 15 and 30 dB.
The sound was recorded and streamed from a multi-channel recording array 61 using Knowles FG23329-PO7 microphones. The recording equipment is not part of the developed virtual environment and it could be replaced by any system capable of streaming audio. The audio delay was measured by making a loud sound (with a clicker) in one box and timing the difference between that signal in one box and the version played back in the other box. This delay was measured to be 308 ms ± 4 ms. The audio to video leading delay was thus 75 ms. This delay is about two times below the 180 ± 42 ms delay to perceive lip-syncing errors in humans 62 , but we are not aware of any comparative data in birds. Figure 1C shows the architecture of the system. Each module contains two Raspberry Pi 3 model B connected to a gigabit switch. One is connected to the camera and is only responsible for streaming video. The other, connected to display and speaker, is responsible for playback of sound and image. A third Raspberry Pi 3 is placed on top of the cage responsible for polling the perch sensor at 20 Hz and emitting state changes as events. It also measures temperature and humidity in the box and emits those as events every minute. The multi-channel microphone array is placed outside the isolator box with a microphone placed in each cage. All computers on the network are synchronised to within milliseconds using the Network Time Protocol 63 implemented with chrony 64 .
Data is streamed to IPv6 multi-cast groups following the publish-subscribe pattern 65 . A PC in the bird room acts as historian, saving the data streams. Data is offloaded to a Ceph 66 persistent storage cluster placed in our data centre. Several event processors continuously analyse the data streams, producing events. These are running in a docker swarm 67 cluster also in our data centre. www.nature.com/scientificreports/ The two-layered architecture is based on data streams and event streams 40 . An event is an association between a specific time and a specific property, in this case a behaviour. A data stream contains sampled data from sensors such as cameras and microphones while an event stream consists of events produced by data stream processors or by asynchronous sensors such as contacts.
The continuous audio stream is analysed real-time to produce the power and entropy events. This analysis is based on estimating the power by squaring the sample values and the entropy as the ratio of the geometric mean to the arithmetic mean. The analysis is implemented as plug-ins to the media-streaming framework gstreamer 68 . The estimates are thresholded with hysteresis 40 and published as kafka events 69 . Hysteresis was used to prevent so-called switch bouncing between on and off when the input signal was close to the noise level 70 . A perch sensor installed in the cage directly generates a perch event every time the bird changes location in the cage. Based on those three events, three complex events are generated, namely bout, contact and directed song. The bout event is active when both power and entropy events are active, and the contact event is active when both birds are perched in front of the screen. The directed song event is active when the bout and contact events are active (Fig. 1D). Event processing was implemented as microservices in docker containers 67 for high modularity and was running in our data centre.

Animals and husbandry.
Adult male and female zebra finches (Taeniopygia guttata) were kept pairwise in breeding cages at the University of Southern Denmark, Odense, Denmark on a 12 h light:dark photoperiod and given water and food ad libitum. All experiments were conducted in accordance with the Danish law concerning animal experiments and protocols were approved by the Danish national ethics committee, the Danish Animal Experiments Inspectorate (Copenhagen, Denmark). The reporting in the manuscript follows the recommendations in the ARRIVE guidelines.
We used adult zebra finches ( > 100 days post hatch) that were established breeding pairs (meaning that they had produced at least one clutch of offspring together) in the animal-animal communication experiments and additionally also single males for the animal-computer experiments. All animals were naive to the experimental conditions. When not in experiment, the birds were kept pairwise in breeding cages or in aviaries containing 200-300 individuals. Under experiment the birds were isolated in sound-attenuated boxes for a maximum of ten days before returning to their usual surroundings. The birds had access to food and fresh water ad libitum served at the bottom of the cage and from feeders at the side of the cage. In the virtual environment, the birds were kept on a 12 h light:dark photoperiod. The temperature was kept between 22 and 28 • C and the relative humidity at 50-60%. The temperature difference between the position in front of the screen (front perch) and behind the blind (back perch) was measured with the system fully on to be 0.4 • C ( ±0.3% accuracy). The isolator boxes attenuated sounds in the 200-8000 Hz range by 40dB (measured by playing back sound in the isolator and record sound levels both inside and outside the box. A fan ensured air flow in the box and provided cooling for the equipment located inside.

Sound segmentation.
Segmentation was based on the silence between syllables and was calculated from the amplitude of the signal (Fig. 4B). We used two threshold values applied on two different metrics for discriminating between sound and silence. The input signal (Fig. 4A) was normalised to range [−1;1] but otherwise not pre-processed. The high threshold was 0.5 and the metric that had to pass this threshold was the absolute value of the signal. The low threshold was also 0.5 but the metric having to go below this threshold was the peak-to-peak amplitude over 325 samples. Both threshold values were determined experimentally based on initial recordings. Starting from the i on th sample where the absolute value of the sample s i surpasses the on-threshold t on (0.5) we searched backwards in time to find the onset sample number i onset defined as the sample where the peak-topeak amplitude over w samples (325) was below the off-threshold t off (0.5).
Similarly we searched for the offset sample number i offset as new samples arrived.
As soon as the last sample was received the segment was passed on to the next stage of the pipeline. Segments shorter than 30 ms or longer than 300 ms were discarded since the duration of zebra finch syllables is expected to be within that range. We implemented both an online and offline version of this segmentation algorithm and used it for all the experiments presented in this paper.
Perch preference protocol. The same data was used for all the animal-animal communication experiments. The virtual environment was powered down for at least 2 h before the birds were moved to the isolator box (day 0) and left off for at least 24 h before it was turned on for another full day (day 1). Experiments ran on the following days starting when the cage lights were turned on and for 2 h thereafter. Day 2 was always with black screen and no sound and the following days the system cycled through perturbations of two speaker states (off, on) and two screen states (off, on) in randomised order. After the experiments the birds were returned to their home cages. To investigate the motivation for using the virtual environment, we looked at the perch preference in different states of the system. Based on the perch sensor, an event was emitted every time the bird changed position in the cage and summing the duration of the events gives a measure of the proportion of time spent in each position (Fig. 1E).
(1) i on : |s i | > t on (2)  www.nature.com/scientificreports/ Call-timing protocol. The audio was segmented as described above and combined into one big dataset covering 12 h a day for all 12 pairs. To provide training data, we then hand-annotated for each bird the first 30 min with both video and audio on. To ease annotation, we used pre-clustering based on cross-correlation, so the observer was presented with oscillograms, spectrograms and sound from 1 min at a time that had already been clustered into groups of sounds with high cross-correlation maximum. To prevent false negatives we used very low threshold to ensure we detected all sounds. The observer then had to name the groups and correct mistakes made by the pre-clustering algorithm. The classes found were song, "stack" calls, distance calls, echo (loud sounds from the other bird triggering the segmentation), wing flapping and noise. Classification of vocalisations followed the descriptions in 25 . We did not observe "tet" calls. Based on the annotations a random forest classifier 71 with 100 estimators was trained for each bird ranging in mean accuracy (10% hold out) from 0.82 to 0.96. To investigate call timing, we measured the time difference from the playback of a stack call (onset + delay) to the next stack call emitted by the animal of interest up to a maximum of 2 s. Histogram of the time differences were constructed (200 bins) and plotted with Gaussian Kernel Density Estimates (KDE, bandwidth=100, Fig. 2B).

Directed song protocol.
To confirm directed song in the animal-animal communication experiments, we selected videos with potential female directed song based on the definition that both birds were on the front perch and the male was vocalising. The videos were then scored by experienced observers (IA, CPHE) for the display of hopping, jumping, beak wiping, looking at the mate and fluffing plumage (see Fig. 3D for examples).
As a quantitative measure, we calculated the coefficient of variance of the fundamental frequency, which is lower in DS compared to US 45 . However, this measure is extracted from stack syllables without frequency modulation such as the one shown in Fig. 3B. The motif of five birds contained a suitable syllable. From spectrograms of the motifs identified by the motif detector, we manually selected the same place in the stack syllable ( Fig. 3B) in 30 motifs from each individual. The fundamental frequency (FF) was estimated from 2048 ( 4 2 ms) samples using the YIN algorithm 72 . Because the reduction of CV of FF during directed song has only been established during harmonic stacks 45 , one male that did not have a harmonic stack in its motif was excluded from this analysis.

Animal-computer communication experiments.
For the driving experiments, the male was left to habituate to the new surroundings until he produced at least 10 motif repetitions during the first 2 h after lights on (day 0). On the following day (day 1) we ran the driving experiment, meaning that videos were displayed showing excited females. Three videos of different females were used, each 1 min long, taken from the animalanimal communication experiments. In case of pair-bonded males, the established mates of the focal animals were not among the female videos. In case of single males the 60,000 training samples were recorded over three days prior to the experiment. The logic of the system is that every time a motif is detected, a timer is reset. If the timer ran out (3 min since last motif) the next video was displayed and otherwise the same video kept getting looped. On day 2 we recorded the control without video playbacks.
Motif detector. We trained one motif detector per individual. The feature extraction is based on summing the rows and columns of the spectrogram (Fig. 4D) and concatenating them to form a feature vector (Fig. 4E). First a spectrogram of the segment is formed by applying Short-Time Fourier Transform (STFT) with FFT size of 1440 and stride of 25 samples. The parameters are all based on the sampling frequency of 48 kHz, the duration of sounds (30-300 ms) and the desired number of time bins. The smallest spectrogram we can make has just one time bin and thus the maximum FFT size is: The stride parameter can then be calculated: The spectrogram is cropped to the approximate audible range for zebra finches 200 Hz to 8 kHz (234 bins) and the time dimension is cropped to the first 512 time bins corresponding to 300 ms (zero-padded if the segment duration is shorter). The rows and columns of the spectrogram are summed and the two resulting vectors F t and F f are concatenated to form a 746-dimensional feature vector F (Fig. 4E).
Each vector is normalised to have a sum of one before concatenation. A training set is created consisting of 60,000 feature vectors from the same individual, each representing one sound segment. We embed each of these high-dimensional points in a two-dimensional space (Fig. 4F) using the t-SNE method introduced in 47 . The method minimises the relative entropy between two distributions, one representing the high-dimensional points and one representing the low-dimensional points, so that close points in the high-dimensional space are also close in the low dimensional space.  www.nature.com/scientificreports/ Since we are interested in stereotyped behaviour, we then followed the method described in Berman et al. 49 placing a Gaussian kernel (bandwidth = 15) on each embedded point. We then generated a density plot (Fig. 4G), and we detected all peaks that are separated by a distance of 15 or more. Using the watershed algorithm 48 on the inverted density plot, we get a set of clusters, each representing roughly a stereotyped syllable. Examples from three different regions can be seen in Fig. 4G. The further a point is from the peak of the region the more likely it is to be mis-classified and thus distance from peak could be used to indicate certainty of the classification. We found that some regions represent a merge of two syllables while some represent part of a split syllable. For higher accuracy in detecting the syllables this information could be used for post processing or better means of segmentation could be introduced.
After the training phase a new data point z is embedded based on the already embedded points, largely using the method described in 49 appendix D. The perplexity parameter of the t-SNE algorithm can be interpreted as a measure of the number of nearest neighbours 47 and therefore we only consider the 'perplexity' nearest points X in the high dimensional space found using the ball tree algorithm 73 .
We then choose an embedding z ′ of the new point z such that conditional probabilities in the low-dimensional space q j|z ′ are similar to those in the high-dimensional space p j|z . The conditional probability of a point x j ∈ X given the new point z is: where X is the vector of nearest points in the high-dimensional space, z is the new point in the high-dimensional space, sigma is found by a binary search for the value that produces a conditional probability with the perplexity set by the user and D KL is the relative entropy given by: The conditional probability of a point x ′ j in the low-dimensional embedding X ′ given the new embedding z ′ : where a,b is the euclidean distance between the points a and b. Since z ′ is the only unknown, we can find it by minimising −D KL between the conditional probability distributions: using the Nelder-Mead simplex algorithm 74 and a start guess being the centroid of the embedding X ′ of the nearest points X. If the start guess is not in the basin of attraction of the global minimum, it means that the new point is not like any points presented during training and the embedded point will shoot towards infinity 49 . One instance of the segmentation algorithm was running for each of the two audio channels used in the experiment and a new feature vector was formed for each detected segment and placed in a queue. A pool of 6 workers (containers running in the cluster) processed feature vectors from the queue using the post-embedding algorithm described above and emitted events containing onset, bird ID, the low dimensional point, a letter representing the region it belonged to and the latency measured from the end of the segment until the event was emitted. The median latency over 3 million syllables was 1.089 s (percentiles: 5th = 0.356, 25th = 0.945, 75th = 1.304, 95th = 2.500). We found that 95% of the segments were classified and the remaining 5% were marked as unclassified.
Sequence events were generated, by event-processors in the cluster, based on the timing of the syllable events. If the onset of the next segment was within a window of 0.5 s after the offset of the previous, it was added to the sequence and otherwise it was assigned to a new sequence. Within 3 s after the end of a sequence an event was emitted containing onset, offset, bird id and the sequence.
To find the motif of the bird we formed a transition probability matrix based on the training data. Since the transitions in the motif were by far the most frequent, the syllables in the motif already stood out. Because the motif was repeated several times in a bout, they formed cycles in the transition matrix (Fig. 4H). We found the cycle by starting from the globally most frequent transition and following the locally most frequent transitions until getting back to an already visited element. If the motif contained repeated syllables or if the bird sang a lot of introductory notes, there was a possibility for dead ends, but they could be detected and solved programmatically.
The birds often sing variations of the long motif so we found the ten most common substrings of the motif and looked for those in the sequence events. We counted the number of occurrences of each substring in the sequence and if a motif was present, we emitted a motif event (based on the most frequent substring in the sequence) containing onset, motif, number of occurrences and bird id. The motif detector was implemented as an event-processor running in the cluster. To verify the motif detector, an observer (LBL) looked at the spectrograms of all the motifs generated during the 2 h of experimentation for one bird (96 motifs) and confirmed that all of them were indeed motifs.